W6: Data Wrangling with Tidy Data, Part 2

Announcements

Last Day of Class

Data-thon

  • Explore 3 datasets with Python/R/Whatever tools people want
  • Let us know your interest in the RSVP form (previous page)

Community Session

Last Week Today

Clear Points

The in-class example problems as we went throughout the lesson were really helpful!

The distinction between filter() and select()

Everything you covered.

The graphics were helpful in remembering filtering vs sorting

Muddy Points

I’m curious when you would choose to use tidyverse vs. base R?

  • It mostly comes down to the simplified syntax and pipes. Tidyverse expects data in a certain format (tidy format), and you can do a lot of things with it.
  • There are also specialized packages for dealing with data types ({lubridate} for dates, {stringr} for characters, {forcats} for categorical data)
  • If you don’t like it, there is always Base R.

Muddy Points

Unclear things started to be more clearer gradually, just need to find more time to practice.

Stay curious and ask questions if you can!

couldn’t figure out how to put numeric and logical data together

Are you talking about filtering using both?

Back to Tidyverse and {dplyr} functions!

Function Name Purpose When
select() Selects sets of columns in df Lesson 4
filter() Filters rows in df Lesson 4
mutate() Calculate a New Column in df Lesson 5
group_by()/summarize() Calculate summary statistics across groups Lesson 5
arrange() Sorts a df by one or more columns Lesson 5
_join() Functions to merge two tables together Lesson 5
|> Operation to build pipelines Lesson 4

Modifying and creating new columns

The mutate() function takes in the following arguments: the first argument is the dataframe of interest, and the second argument is a new or existing data variable that is defined in terms of other data variables.

I think of them like formulas in Excel. . . .

metadata |>
  select(ModelID, Age) |>
  head()
metadata2 <- metadata |>
              mutate(olderAge = Age + 10) 
metadata2 |>
  select(ModelID, Age, olderAge) |>
  head()
metadata |>
  mutate(olderAge = Age + 10)

expression data.frame

expression$KRAS_Exp[1:10]
 [1] 4.634012 4.638653 4.032101 5.503031 3.713696 3.972693 3.235727 4.135042
 [9] 9.017365 3.940167
expression2 = expression |>
  mutate(log_KRAS_Exp = log(KRAS_Exp))
expression2$log_KRAS_Exp[1:10]
 [1] 1.533423 1.534424 1.394288 1.705299 1.312028 1.379444 1.174254 1.419498
 [9] 2.199152 1.371223

Try it Out 1

One thing we often need to do is scale or normalize data. Let’s try scaling by the maximum value in a column. Try calculating a new variable KRAS_scaled by calculating:

KRAS_Exp / max(KRAS_Exp)

Try it Out 2

Try calcluating a new variable in mutation called TP53_CDKN2A which is equal to:

  • TP53_Mut | CDKN2A_Mut

Then filter new_mutation by TP53_CDKN2A == TRUE. How many rows did you return?

Grouping and summarizing dataframes

metadata_grouped = group_by(metadata, OncotreeLineage)
metadata_type = summarise(metadata_grouped, MeanAge = mean(Age, na.rm = TRUE), Count = n())
head(metadata_type)

Grouping and summarizing dataframes

The group_by() function returns the identical input dataframe but remembers which variable(s) have been marked as grouped.

The summarise() (you can also use summarize()) returns one row for each combination of grouping variables, and one column for each of the summary statistics that you have specified.

Functions you can use for summarise() must take in a vector and return a simple data type, such as any of our summary statistics functions: mean(), median(), min(), max(), etc.

The exception is n(), which returns the number of entries for each grouping variable’s value.

Try it Out 1

First group by OncotreeLineage. Then try calculating the maximum age as max_age within metadata when it is grouped by OncotreeLineage:

Quick Note

Our summarization has a lot of NAs. We can use the na.rm argument in max() to drop NA values in our calculation.

Try it Out 2

First group by OncotreeLineage. Then try calculating the median age as median_age within metadata when it is grouped by OncotreeLineage. (Hint: use the median() function. Bonus points if you use the na.rm argument)

Think about this:

What is the difference between these two pipelines?

arrange()

If we need to sort by a column, we can use arrange():

metadata |>
  arrange(Age)

Descending order

If we want to sort by descending order, then we can wrap our column in desc() (short for descending):

metadata |>
  arrange(desc(Age))

Sorting by Multiple Columns

We can sort by multiple columns by passing in multiple variables:

metadata |>
  arrange(desc(Age), ModelID)

Try it Out

Try sorting by desc(Age) and OncotreeLineage, and then by OncotreeLineage and desc(Age). Does order matter?

Merging two dataframes together

metadata

ModelID OncotreeLineage Age
“ACH-001113” “Lung” 69
“ACH-001289” “CNS/Brain” NA
“ACH-001339” “Skin” 14

expression

ModelID PIK3CA_Exp log_PIK3CA_Exp
“ACH-001113” 5.138733 1.636806
“ACH-001289” 3.184280 1.158226
“ACH-001339” 3.165108 1.152187

I want to compare the relationship between OncotreeLineage and PIK3CA_Exp:

ModelID PIK3CA_Exp log_PIK3CA_Exp OncotreeLineage Age
“ACH-001113” 5.138733 1.636806 “Lung” 69
“ACH-001289” 3.184280 1.158226 “CNS/Brain” NA
“ACH-001339” 3.165108 1.152187 “Skin” 14

Merging two dataframes together

One strategy we can use is to only merge on common ids between the two tables. If one table has ids that aren’t in the second table, then we’ll remove those rows.

We see that in both dataframes, the rows (observations) represent cell lines with a common column ModelID, so let’s merge these two dataframes together, using inner_join():

merged = inner_join(metadata, expression, by = "ModelID")

Let’s take a look at the dimensions:

dim(metadata)
[1] 1864   30
dim(expression)
[1] 1450  536
dim(merged)
[1] 1450  565

Which rows do we keep from each table?

inner_join() keeps all observations common to both dataframes based on the common column defined via the by argument.

merged |> 
  select(ModelID, KRAS_Exp, OncotreeLineage)

Merging two dataframes together: variations

Given xxx_join(x, y, by = "common_col"),

  • full_join() keeps all observations.

  • left_join() keeps all observations in x.

  • right_join() keeps all observations in y.

  • inner_join() keeps observations common to both x and y.

Try it Out

Try out inner_join() on the two tables and compare the number of rows in the merged table.

Now try full_join() on the two tables and compare the number of rows in the merged table.

Next Week

Learning Community Week!

How did we do?

Weekly Checkin